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Abstract 

We consider optimization of generalized performance metrics for binary classification by means of 
surrogate losses. We focus on a class of metrics, which are linear-fractional functions of the false posi¬ 
tive and false negative rates (examples of which include Fp-measme, Jaccard similarity coefficient, AM 
measure, and many others). Our analysis concerns the following two-step procedure. First, a real-valued 
function / is learned by minimizing a surrogate loss for binary classification on the training sample. It is 
assumed that the surrogate loss is a strongly proper composite loss function (examples of which include 
logistic loss, squared-error loss, exponential loss, etc.). Then, given /, a threshold 9 is tuned on a separate 
validation sample, by direct optimization of the target performance metric. We show that the regret of 
the resulting classifier (obtained from thresholding / on 9) measured with respect to the target metric is 
upperbounded by the regret of / measured with respect to the surrogate loss. We also extend our results 
to cover multilabel classification and provide regret bounds for micro- and macro-averaging measures. 
Our findings are further analyzed in a computational study on both synthetic and real data sets. 


1 Introduction 

In binary classification, misclassification error is not necessarily an adequate evaluation metric, and one often 
resorts to more complex metrics, better suited for the problem. For instance, when classes are imbalanced, 
F) 3 -measure (Lewis, 1995; Janschc, 2005; Nan ct al, 2012) and AM measure (balanced error rate) (Mcnon 
ct al, 2013) are frequently used. Optimizing such generalized performance metrics poses computational and 
statistical challenges, as they cannot be decomposed into losses on individual observations. 

In this paper, we consider optimization of generalized performance metrics by means of surrogate losses. 
We restrict our attention to a family of performance metrics which are ratios of linear functions of false 
positives (FP) and false negatives (FN). Such functions are called linear-fractional, and include the afore¬ 
mentioned Ffj and AM measures, as well as Jaccard similarity coefficient, weighted accuracy, and many others 
(Koycjo ct al, 2014, 2015). We focus on the most popular approach to optimizing generalized performance 
metrics in practice, based on the following two-step procedure. First, a real-valued function / is learned 
by minimizing a surrogate loss for binary classification on the training sample. Then, given /, a threshold 
9 is tuned on a separate validation sample, by direct optimization of the target performance measure with 
respect to a classifier obtained from / by thresholding at 9, classifying all observations with value of / above 
the threshold as positive class, and all observations below the threshold as negative class. This approach can 
be motivated by the asymptotic analysis: minimization of appropriate surrogate loss results in estimation 
of conditional (“posterior”) class probabilities, and many performance metrics are maximized by a classifier 
which predicts by thresholding on the scale of conditional probabilities (Nan ct al, 2012; Zhao ct al, 2013; 
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Dembczynski has been supported by the Polish National Science Centre under grant no. 2013/09/D/ST6/03917. 
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Koycjo et al, 2014). However, it is unclear what can be said about the behavior of this procedure on finite 
samples. 

In this work, we are interested in theoretical analysis and justification of this approach for any sample 
size, and for any, not necessarily perfect, classification function. To this end, we use the notion of regret with 
respect to some evaluation metric, which is a difference between the performance of a given classifier and the 
performance of the optimal classifier with respect to this metric. We show that the regret of the resulting 
classifier (obtained from thresholding / on 9) measured with respect to the target metric is upperbounded by 
the regret of / measured with respect to the surrogate loss. Our result holds for any surrogate loss function, 
which is strongly proper composite (Agarwal, 2014), examples of which include logistic loss, squared-error loss, 
exponential loss, etc. Interestingly, the proof of our result goes by an intermediate bound of the regret with 
respect to the target measure by a cost-sensitive classification regret. As a byproduct, we get a bound on the 
cost-sensitive classification regret by a surrogate regret of a real-valued function which holds simultaneously 
for all misclassification costs: the misclassification costs only influence the threshold, but not: the function, 
the surrogate loss, or the regret bound. 

We further extend our results to cover multilabel classification, in which the goal is to simultaneously 
predict multiple labels for each object. We consider two methods of generalizing binary classification perfor¬ 
mance metrics to the multilabel setting: the macro-averaging and the micro-averaging (Manning ct al, 2008; 
Parambath ct al, 2014; Koycjo ct al, 2015). The macro-averaging is based on first computing the performance 
metric separately for each label, and then averaging the metrics over the labels. In the micro-averaging, the 
false positives and false negatives for each label are first averaged over the labels, and then the performance 
metric is calculated on these averaged quantities. We show that our regret bounds hold for both macro- and 
micro-averaging measures. Interestingly, for micro averaging, only a single threshold needs to be tuned and 
is shared among all labels. 

Our finding is further analyzed in a computational study on both synthetic and real data sets. We 
compare the performance of the algorithm when used with two types of surrogate losses: the logistic loss 
(which is strongly proper) and the hinge loss (which is not a proper loss). On synthetic data sets, we analyze 
the behavior of the algorithm for discrete feature distribution (where nonparametric classifiers are used), 
and for continuous feature distribution (where linear classifiers are used). Next, we look at the performance 
of the algorithm on the real-life benchmark data sets, both for binary and multilabel classification. 

We note that the goal of this paper is not to propose a new learning algorithm, but rather to provide a 
deeper statistical understanding of an existing method. The two-step procedure analyzed here (also known 
as the plug-in method in the case when the outcomes of the function have a probabilistic interpretation), 
is commonly used in the binary classification with generalized performance metrics, but this is exactly the 
reason why we think it is important to study this method in more depth from a theoretical point of view. 

1.1 Related work 

In machine learning, numerous attempts to optimize generalized performance metrics have been proposed. 
They can be divided into two general categories. The structured loss approaches (Musicaiit ct al, 2003; 
Tsochaiitaridis ct al, 2005; Petterson and Gaetano, 2011, 2010) rely on incorporating the performance metric 
into the training process, thus requiring specialized learning algorithms to optimize non-standard objectives. 
On the other hand, the plug-in approaches, which are very closely related to the topic of this work, are based 
on obtaining reliable class conditional probability estimates by employing standard algorithms minimizing 
some surrogate loss for binary classification (such as logistic loss used in logistic regression, exponential loss 
used in boosting, etc.), and then plugging these estimates into the functional form of the optimal prediction 
rule for a given performance metric (.lansche, 2007; Nan ct al, 2012; Dcmbczyhski ct al, 2013; Waegeman 
et al, 2013; Narasimhan ct al, 2014, 2015; Koycjo ct al, 2014, 2015). 

Existing theoretical work on generalized performance metrics is mainly concerned with statistical consis¬ 
tency also known as calibration, which determines whether convergence to the minimizer of a surrogate loss 
implies convergence to the minimizer of the task performance measure as the sample size goes to infinity 
(Dcmbczyhski et al, 2010; Nan et al, 2012; Gao and Zhou, 2013; Zhao ct al, 2013; Narasimhan et al, 2014; 
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Koycjo et al, 2014, 2015). Here we give a stronger result which bounds the regret with respect to the perfor¬ 
mance metric by the regret with respect to the surrogate loss. Our result is valid for all finite sample sizes 
and informs about the rates of convergence. 

We also note that two distinct frameworks are used to study the statistical consistency of classifiers with 
respect to performance metrics: Decision Theoretic Analysis (DTA), which assumes a test set of a fixed size, 
and Empirical Utility Maximization (EUM), in which the metric is defined by means of population quantities 
(Nan et al, 2012). In this context, our work falls into the EUM framework. 

Parambath et al (2014) presented an alternative approach to maximizing linear-fractional metrics by 
learning a sequence of binary classification problems with varying misclassification costs. While we were 
inspired by their theoretical analysis, their approach is, however, more complicated than the two-step ap¬ 
proach analyzed here, which requires solving an ordinary binary classification problem only once. Moreover, 
as part of our proof, we show that by minimizing a strongly proper composite loss, we are implicitly min¬ 
imizing cost-sensitive classification error for any misclassification costs without any overhead. Hence, the 
costs need not be known during learning, and can only be determined later on a separate validation sample 
by optimizing the threshold. Narasimhan ct al (2015) developed a general framework for designing provably 
consistent algorithms for complex multiclass performance measures. They relate the regret with respect to 
the target metric to the conditional probability estimation error measured in terms of Li-metric. Their 
algorithms rely on using accurate class conditional probability estimates and multiple solving cost-sensitive 
multiclass classification problems. 

The generalized performance metrics for binary classification are employed in the multilabel setting 
by means of one of the three averaging schemes (Wacgcman et al, 2013; Parambath et al, 2014; Koyejo 
ct al, 2015): instance-averaging (averaging errors over the labels, averaging metric over the examples), 
macro-averaging (averaging errors over the examples, averaging metric over the labels), and micro-averaging 
(averaging errors over the examples and the labels). Koyejo ct al (2015) characterize the optimal classifiers 
for multilabel metrics and prove the consistency of the plug-in method. Our regret bounds for multilabel 
classification can be seen as a follow up on their work. 

1.2 Outline 

The paper is organized as follows. In Section 2 we introduce basic concepts, definitions and notation. The 
main result is presented in Section 3 and proved in Section 4. Section 5 extends our results to the multilabel 
setting. The theoretical contribution of the paper is complemented by computational experiments in Section 
6, prior to concluding with a summary in Section 7. 


2 Problem setting 

2.1 Binary classification 

In binary classification, the goal is, given an input (feature vector) a: S A, to accurately predict the output 
(label) y G { — 1,1}. We assume input-output pairs {x,y) are generated i.i.d. according to Pr(a:,?/). A 
classifier is a mapping /i: A —>■ {—1,1}. Given Ii, we define the following four quantities: 

TP(/i) =Pr(/i(a;) = 1 A y = 1), 

FF{h) = PT{h{x) = 1 Ay = —1), 

TN(/i) = PT{h{x) = -1 A y = -1), 

F'N{h) = Pr{h{x) = —1 A y = 1), 

which are known as true positives, false positives, true negatives and false negatives, respectively. We also 
denote Pr(y = 1) by P. Note that for any h, FP(ft,)-|-TN(/i) = Pr(y = —1) = 1 — P and TP(/i)-|-FN(/i) = P, 
so out of the four quantities above, only two are independent. In this paper, we use the convention to 
parameterize all metrics by means of FP(/i) and FN{h). 
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metric 


Accuracy 


expression 

Acc = 1 - FN - FP 


F) 3 -measure 


_ (l+/3^)(P-FN) 

P (1+/32)P-FN+FP 


Jaccard similarity 


T _ P-FN 
■J — p+FP 


AM measure 
Weighted accuracy 


AM - 2P(1-P)-PFP-(1-P)FN 
AiVl — 2P(1-P) 


WA = 


wi (1 —P)+it;2P—ii;2FN 
Wi{l — P)+W2P 


Table 1: Some popular linear-fractional performance metrics expressed as functions of FN and FP. See 
(Koycjo et al, 2014) for a more detailed description. 


We call a two-argument function 'I' = 4t(FP, FN) a (generalized) classification performance metric. Given 
a classifier h, we define 4'(/i) = 4'(FP(/i), FN(/i)). Throughout the paper we assume that 'I'(FP, FN) is linear- 
fractional, i.e., is a ratio of linear functions: 


^'(FP,FN) 


Oq -f OiFP -|- a2FN 
bo + 6iFP -h 52FN ’ 


( 1 ) 


where we allow coefficients Oi, bi to depend on the distribution Pr(a;, y). Note, that our convention to param¬ 
eterize the metric by means of (FP,FN) does not affect definition (1), because can be reparameterized to 
(FPjTN), (TPjFN), or (TP,TN), and will remain linear-fractional in all these parameterizations. We also 
assume 4'(FP,FN) is non-increasing in FP and FN, a property that is inherently possessed by virtually all 
performance measures used in practice. Table 1 lists some popular examples of linear-fractional performance 
metrics. 

Let h’^ be the maximizer of 'l'(h) over all classifiers: 


h'^ = argmax 'it(h) 

h: X->-{-l,l} 


(if argmax is not unique, we take to be any maximizer of 4'). Given any classifier h, we define its -regret 
as: 

Reg^(/i) = 4'(/i^) - 4'(/i). 

The ^/-regret is nonnegative from the definition, and quantifies the suboptimality of h, i.e., how much worse 
is h comparing to the optimal h^,. 


2.2 Strongly proper composite losses 

Here we briefly outline the theory of strongly proper composite loss functions. See (Agarwal, 2014) for a 
more detailed description. 

Define a binary class probability estimation (CPE) loss function (Reid and Williamson, 2010, 2011) as a 
function c: { — 1,1} x [0,1] —IR+, where c{y, rj) assigns penalty to prediction 77 , when the observed label is 
y. Define the conditional c-risk as:^ 

riskc(? 7 , f}) = r]c{l, 77) -k (1 - v)c{-l, 77 ), 

the expected loss of prediction rj when the label is drawn from a distribution with Pr(?/ = 1) = 77 . We 
say GPE loss is proper if for any 77 G [0,1], 77 G argmin-g[Q_]^] riskc( 77 , 77 ). In other words, proper losses are 

^Throughout the paper, we follow the convention that all conditional quantities are lowercase (regret, risk), while all uncon- 
ditional quantities are uppercase (Regret, Risk). 


4 











minimized by taking the true class probability distribution as a prediction; hence rj can be interpreted as 
probability estimate of rj. Define the conditional c-regret as: 

regc(?7: V) = riskc(??, g) - inf riskc(r 7 , r/) 

17' 

= riskc(??, rj) - riskc(ry, ry), 

the difference between the conditional c-risk of rj and the optimal c-risk. We say a CPE loss c is X-strongly 
proper if for any ry, rj: 

regc(??,?7) > 

i.e. the conditional c-regret is everywhere lowerbounded by a squared difference of its arguments. It can be 
shown (Agarwal, 2014) that under mild regularity assumption a proper CPE loss c is A-strongly proper if 
and only if the function Hdrf) := riskc(?y, ry) is A-strongly concave. This fact lets us easily verify whether a 
given loss function is A-strongly proper. 

It is often more convenient to reparameterize the loss function from 5y S [0,1] to a real-valued / G R 
through a strictly increasing (and therefore invertible) link function if: [0,1] —>■ M: 

^{yJ) = c{y,if~^{f)). 

If c is A-strongly proper, we call function £: {—1,1} x R —> R_|_ X-strongly proper composite loss function. 
The notions of conditional £-risk riskf(?y,/) and conditional Aregret regi{ri, f) extend naturally to the case 
of composite losses: 


risk^(ry, /) = ry£(l, /) -h (1 - ry)^(-l, /) 
reg^(»?, /) = I'iskK??, /) - inf risk7:(ry, /') 

J 

= riskgjr], f) - risk^(?y, ifiig)). 

and the strong properness of underlying CPE loss implies: 

reg^(?7, f)>^{r]- (2) 

As an example, consider a logarithmic scoring rule: 

c{y,y) = -{y = ll logiy- {y = -ll log(l -iy), 

where |Q] is the indicator function, equal to 1 if Q holds, and to 0 otherwise. Its conditional risk is given 
by: 

riskc(7y,?y) = -Tylogiy- (1 - ry)log(l - rj), 

the cross-entropy between 7y and fj. The conditional c-regret is the binary Kullback-Leibler divergenee between 
7y and rj: 

regc(??, jy) = 7y log ^ -k (1 - rj) log -- 

V 1 - ry 

Note that since Hj-q) = riskc(ry, ?y) is the binary entropy function, and > 4, c is 4-strongly 

proper loss. Using the logit link function ^/’(iy) = log we end up with the logistic loss function: 

i{y,f) = log (l-\-e~yf^ , 

which is 4-strongly proper composite from the definition. 

Table 2 presents some of the commonly used losses which are strongly proper composite. Note that the 
hinge loss i{y,f) = (1 — yf)+, used, e.g., in support vector machines (Hastie et al, 2009), is not strongly 
proper composite (even not proper composite). 
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loss function 

squared-error 

logistic 

exponential 

KyJ) 

(y-f? 

log (1 -1- e ~ fy ) 

e-vf 

c(l,r?) 

4(1 — 7 ))^ 

-log?? 

ci-hv) 

4j)'2 

- log(l - 57 ) 

V 

i’iv) 

257-1 

log A 


A 

8 

4 

4 


Table 2: Three popular strongly proper composite losses: squared-error, logistic and exponential losses. 
Shown are the formula £{y,f), the underlying CPE loss c{y,fj) with the link function tpirj), as well as the 
strong properness constant A. See (Agarwal, 2014) for more details and examples. 

3 Main result 

Given a real-valued function / : A —>• M, and a A-strongly proper composite loss £{y, /), define the £-risk of 
/ as the expected loss of f{x) with respect to the data distribution: 

Risk£(/) = [i{yj{x))] 

= [risk^(r 7 (a:),/(a;))] , 

where r]{x) = Pr(?/ = l|a;). Let // be the minimizer Risk£(/) over all functions, // = argmin^ Risk^(/). 
Since £ is proper composite: 

f*{x) = ^{r]{x)). 

Define the Gregret of / as: 


R-eg^(/) = Riskf(/) - Risk^(/;) 

= [risk^(? 7 (a;),/(a:)) -risk£(? 7 (x),/;(a;))] . 

Any real-valued function / : A —>■ K can be turned into a classifier hf^e- X —>■ {—1,1}, by thresholding 
at some value 9: 

hf,e{x) = sgn{f{x) - 9). 

The purpose of this paper is to address the following problem: given a function / with Aregret Reg^(/), and 
a threshold 9, what can we say about 'I'-regret of For instance, can we bound Reg^(/i/_e) in terms of 
Reg^(/)? We give a positive answer to this question, which is based on the following regret bound: 

Lemma 1. Let 4'(FP,FN) be a linear-fractional function of the form (1), which is non-increasing in FP 
and FN. Assume that there exists 7 > 0, such that for any classifier h: X ^ {~1) 1}- 

bo -\- 6iFP(/i) -I- 62 FN(/i) > 7, 

i.e. the denominator of tk is positive and bounded away from zero. Let £ be a X-strongly proper composite 
loss function. Then, there exists a threshold 9*, such that for any real-valued function /: A —>■ K, 


Reg^(/i/,e.) < C 


^V^eggif), 


where C = {'i/{h%){bi -\- 62 ) - (ai -I- 02 )) > 0. 
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metric 

7 

C 

Accuracy 

I 

2 

F^-measure 


l+/3= 

/32P 

Jaccard similarity 

P 

J* + l 

P 

AM measure 

2P(1 - P) 

1 

2P(1-P) 

Weighted accuracy 

wiP u;2(l - P) 

tUi+IU2 

WiP-\-W2{l — P) 


Table 3: 


Constants which appear in the bound of Lemma 1 for several performance metrics. 


The proof is quite long and hence is postponed to Section 4. Interestingly, the proof goes by an interme¬ 
diate bound of the dt-regret by a cost-sensitive classification regret. We note that the bound in Lemma 1 is 
in general unimprovable, in the sense that it is easy to find /, £, and distribution Pr(x,y), for which the 

bound holds with equality (see proof for details). We split the constant in front of the bound into C and A, 
because C depends only on 4>, while A depends only on i. Table 3 lists these constants for some popular 
metrics. We note that constant 7 (lower bound on the denominator of T) will be distribution-dependent in 
general (as it can depend on P = Pr(y = 1)) and may not have a uniform lower bound which holds for all 
distributions. 

Lemma 1 has the following interpretation. If we are able to find a function / with small Pregret, we are 
guaranteed that there exists a threshold 9* such that has small 'L-regret. Note that the same threshold 
6* will work for any /, and the right hand side of the bound is independent of 9*. Hence, to minimize the 
right hand side we only need to minimize Pregret, and we can deal with the threshold afterwards. 

Lemma I also reveals the form of the optimal classifier take / = /^ in the lemma and note that 
Regg{fg) = 0 , so that Reg^{hf^^g*) = 0 , which means that is the minimizer of 4': 

h\,{x) = sgn(/;(x) - 9*) = sgn( 77 (x) - ^/>"^(r)). 


where the second equality is due to // = ^{r]) and strict monotonicity of '0. Hence, h\, is a threshold 
function on 77 . The proof of Lemma 1 (see Section 4) actually specifies the exact value of the threshold 0*: 


0-i(r) 


- ai _ 

+ ^2) ~ (fli + 0,2) ’ 


(3) 


which is in agreement with the result obtained by Koyejo et al (2014).^ 

To make Lemma 1 easier to grasp, consider a special case when the performance metric 'I'(FP,FN) = 
1 —FP —FN is the classification accuracy. In this case, (3) gives 'L“^(d*) = 1/2. Hence, we obtained the well- 
known result that the classifier maximizing the accuracy is a threshold function on 7 at 1/2. Then, Lemma 1 
states that given a real-valued /, we should take a classifier which thresholds f at 9* = '0(1/2). Using 
Table 2, one can easily verify that 9* = 0 for logistic, squared-error and exponential loss. This agrees with 
the common approach of thresholding the real-valued classifiers trained by minimizing these losses at 0 to 
obtain the label prediction. The bounds from the lemma are in this case identical (up to a multiplicative 
constant) to the bounds obtained by Bartlett ct al (200G). 

Unfortunately, for more complicated performance metrics, the optimal threshold 9* is unknown, as (3) 
contains unknown quantity 4'(/i^), the value of the metric at optimum. The solution in this case is to, given 
/, directly search for a threshold which maximizes 4'(/iy' g). This is the main result of the paper: 

^To prove (3), Koyejo ct al (2014) require an absolute continuity assumption on the marginal distribution over instances 
with respect to some dominating measure, so as to guarantee the existence of an appropriate density. Our analysis shows that 
the existence of a density is not required. 
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Theorem 2. Given a real-valued function f, let 9*j = argmaxg ’^(hf^g). 
notation from Lemma 1: 

R-eg^(^/,e|) < C'^fv^Reg^. 


Then, under the assumptions and 


Proof. The result follows immediately from Lemma 1: Solving maxg ’^{hf^g) is equivalent to solving ming Reg^(h^^e), 
and ming Reg^(h/^6i) < Reg^{hf^g»), where 9* is the threshold given by Lemma 1. □ 

Theorem 2 motivates the following procedure for maximization of 

1. Find / with small ^-regret, e.g. by using a learning algorithm minimizing £-risk on the training sample. 

2. Given /, solve 9*f = argmax^ '^{hf^g). 

Theorem 2 states that the d^-regret of the classifier obtained by this procedure is upperbounded by the 
.^-regret of the underlying real-valued function. 

We now discuss how to approach step 2 of the procedure in practice. In principle, this step requires 
maximizing 'k defined through FP and FN, which are expectations over an unknown distribution Pr(a::, 2 /). 
However, it is sufficient to optimize 9 on the empirical counterpart of calculated on a separate validation 
sample. Let T = be the validation set of size n. Define: 

^ n 1 ^ 

= - 1 , 2 /, = 11 , 

^ i=l * i=l 

the empirical counterparts of FP and FN, and let 'l'(/i) = d/lFPlli), FN(/i)) be the empirical counterpart of 
the performance metric We now replace step 2 by: 

Given / and validation sample T, solve 9f = argmaxg 

In Theorem 3 below, we show that: 


- R-eg>i,(/i/.e*) = O 

so that tuning the threshold on the validation sample of size n (which results in 9f) instead of on the 
population level (which results in 9*^) will cost at most additional regret. The main idea of 

the proof is that finding the optimal threshold comes down to optimizing within a class of {—1, l}-valued 
threshold functions, which has small Vapnik-Chervonenkis dimension. This, together with the fact that 
under assumptions from Lemma 1, 'k is stable with respect to its arguments, implies that is close 

to '^{hpg*). 

Theorem 3. Let the assumptions from Lemma 1 hold, and let: 

Di = sup |&i 5 '(FP,FN) — Oil, D 2 = sup |&2^'(FP, FN) — 02], 

(FP,FN) (FP,FN) 

and D = max{Z?i,D 2 }. Given a real-valued function f, and a validation set T of size n generated i.i.d. 
from P{x,y), let 9f = argmax^ '^{hp g) he the threshold maximizing the empirical counterpart o/'k evaluated 
on T. Then, with probability 1 — <5 (over the random choice of'T): 


Reg,p(h^ g^) < C 



V^egiif) + 


160 


/4(1 -I- logn) -I- 2 log 


16 

s 


7 


n 










Proof. For any FP and FN, we have: 


5^'(FP,FN) 


aFP 


|ai(&o + &iFP + 62FN) — bi{ao + aiFP + a2FN)| 
(60 + ^iFP + 62 FN )2 

|6i4'(FP,FN) - ail ^ |&i«'(FP,FN)-ail ^ D 
bo + 61 FP + & 2 FN ~ 7 ~ 7 ’ 


and similarly, 


a«'(FP,FN) 


aFN 


| 62 ^(FP,FN)-a 2 | ^ D 


bo + biFP + 62 FN 7 

For any (FP,FN) and (FP',FN'), Taylor-expanding 4'(FP,FN) around (FP',FN') up to the first order and 
using the bounds above gives: 


n 


^'(FP, FN) < ^'(FP', FN') + — (|FP - FP'| + |FN - FN'|) . 


(4) 


Now, we have: 


Reg^ = Reg^(/i/,e.) + «'(/i/.e*) - 

< C']fjV'Reg^(f) + 4'(/i/,e*) - 

where we used Theorem 2. Thus, it amounts to bound 'F(hf g») — From the definition of 0/, 

^ Fence: 

4'(/i/,e*) - VP - V/) 

< 2 sup I'Fihf^e) - ^(^/,e)| 
e 

= 2sup|^(FP(/i/,e),FN(/r/,e))-4'(FT(/i/,e),™(V))|. 


where we used the definition of Using (4), 


4'(/i/,ep -'F{hfg )< —(sup |FP(/i/,e) - FP(/i/,e)| + sup |FN(ft,/,e) - FN(/i/,e)|y 

s J’ t J \ g g J 

Note that the suprema above are on the deviation of empirical mean from the expectation over the class of 
threshold functions, which has Vapnik-Chervonenkis dimension equal to 2. Using standard argument from 
Vapnik-Chervonenkis theory (see, e.g., Devroyc ct al, 1996), with probability 1 — | over the random choice 
of T: ’ _ 


sup |FP(/i/_e) - FP(/i/_e)| < 


1 4(1 + log n) + 2 log ^ 


and similarly for the second supremum. Thus, with probability 1 — <5, 


16 




which finishes the proof. □ 

We note that, contrary to a similar results by Koyejo ct al (2014), Theorem 3 does not require continuity 
of the cumulative distribution of r]{x) around 0*. 
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4 Proof of Lemma 1 

The proof can be skipped without affecting the flow of later sections. The proof consists of two steps. First, 
we bound the ^'-regret of any classifier h by its cost-sensitive classification regret (introduced below). Next, 
we show that there exists a threshold 9*, such that for any /, the cost-sensitive classihcation regret of 
is upperbounded by the ^-regret of /. These two steps will be formalized as Proposition 4 and Proposition 

5. 

Given a real number a € [0,1], define a cost-sensitive classification loss £a ■ {—1,1} x {—1,1} —>■ K+ as: 
^a{y,y) = afy = -Illy = 11 + (1 - a)ly = ll|y = - 11 . 

The cost-sensitive loss assigns different costs of misclassification for positive and negative labels. Given 
classifier h, the cost-sensitive risk of h is: 

Hiskct(/i) lE^^. yj(y,/i(j:))] 

= aFP(/i)-h (1 - a)FN(/i), 


and the cost-sensitive regret is: 

Reg„(/i) = RiskQ(/i) - Riska(h*), 
where /i* = argmin^ Riska(/i). We now show the following two results: 
Proposition 4. Let dt satisfy the assumptions from Lemma 1. Define: 

’^*bi — oi 

4'*(6i-i-& 2 ) - (ai + 02 )' 

Then, a G [0,1] and for any classifier h, 


Reg^{h) < C'Reg„(/i), 

where C is defined as in the content of Lemma 1. 

Proof. The proof generalizes the proof of Proposition 6 from Parambath ct al (2014), which concerned the 
special case of F^-measure. For the sake of clarity, we use a shorthand notation 4' = 4'(/i), 4^* = 4'(/i^), 
FP = FP(/i), FN = FN(/i), A = ao-l-aiFP-|-a 2 FN, B = 60 + & 1 FP-I-& 2 FN for the numerator and denominator 
of 4'(/i), and analogously FP*, FN*, A* and B* for 4'(h^). In this notation: 

- A 

Reg^(fe) = 4/*-4>= ^ 

^0 

/■ 

_ 4'*R-y4- {^*B* -A*) 

~ B 

_ 'i>*{B - B*) - {A- A*) 

~ B 

(4'*6i - ai) (FP - FP*) -h (4'*&2 - 02 ) (FN - FN*) 

“ B 

^ (4>*bi - ai) (FP - FP*) + (^*&2 - 02 ) (FN - FN*) 

7 

where the last inequality follows from R > 7 (assumption) and the fact that Reg^(/i) > 0 for any h. Since 
4^ is non-increasing in FP and FN, we have 

94-* _ aiB* - biA* _ oi - &i4'* 

5FP* “ (R*)2 “ B~* - ’ 
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and similarly < 0. This and the assumption B* > 7 implies that both 'P* 6 i — ai and 

'£'*62 ~ 02 are non-negative, so can be interpreted as misclassification costs. If we normalize the costs by 
defining: 

— oi 

-b 62) - (oi-b 02) ’ 


then ( 6 ) implies: 


Reg,j,(/i) < C {Riska{h) — Riska(/i^)) 

< C (Riska{h) - RiskQ(/i*)) = CReg^{h). 


□ 

Proposition 5. For any real-valued function /: X —>■ K any X-strongly proper composite loss I with link 
function tp, and any a S [0,1] .• 

R-ega(^/.e*) < y|'\/Reg^(/), (7) 

where 9* = 

Proof. First, we will show that (7) holds conditionally for every x. To this end, we fix x and deal with 
h{x) S { — 1 , 1 }, f{x) G K and 'q{x) G [ 0 , 1 ], using a shorthand notation h, /, 77. 

Given rj G [0,1] and h G {—1,1}, define the conditional cost-sensitive risk as: 

riska(? 7 , h) = a(l - 77)|/i = 1] -b (1 - a)r]lh = -1], 

Let /i* = argmin^ riska(? 7 ,/i). It can be easily verified that: 

K=sgn{v-a). (8) 


Define the conditional cost-sensitive regret as 

reg„(77, h) = riska{r], h) - riska(? 7 , /i*). 

Note that li h = /i*, then h) = 0. Otherwise, ieg^{r], h) = \r] — a|, so that: 

rega(?7,^) = lh^ h*J\r]-a\. 

Now assume h = sgn{rj — a) for some 77 , i.e., h is of the same form as /r* in ( 8 ), with 77 replaced by rj. We 
show that for such fi, 

reg„(?7,^) < b-^1- (9) 

This statement trivially holds when h = h*^. If /i 7 ^ h*, then 77 and 77 are on the opposite sides of a (i.e. 
either rj > a and rj < a or g < a and fj > a), hence [77 — q;| < I 77 — ? 7 |, which proves (9). 

Now, we set the threshold to 9* = 7 /;(a), so that given / G K, 

hfp, = sgn(/ - 9*) = sgn(/ - '0(a)) = sgn(0"^(/) - a), 

due to strict monotonicity of 0. Using (9) with h = hf^, and rj = 0“^(/) gives: 

rega(?7, < |77 - 0 "^(/)| = y(77-0-i(/))2 

< y|^Vreg^(? 7 ,/), ( 10 ) 

and the last inequality follows from strong properness ( 2 ). 


II 








To prove the unconditional statement (7), we take expectation with respect to x on both sides of (10): 




(by (10)) 


< 


A®" 


y/vege{ri{x)J{x)) 


< [regf(?7(a;),/(a:))] 

= y|-VReg£(/), 


( 11 ) 


where the second inequality is from Jensen’s inequality applied to the concave function x i—>■ ^/x. 

We note that derivation of (9) follows the steps of the proof of Lemma 4 in Mcnon et al (2013), while 
(10) and (11) were shown in the proof of Theorem 13 by Agarwal (2014). Hence, the proof is essentially a 
combination of existing results, which are rederived here for for the sake of completeness. □ 

Proof of Lemma 1. Lemma 1 immediately follows from Proposition 4 and Proposition 5. □ 

Note that the proof actually specifies the exact value of the universal threshold, 0* = where a is 

given by (5). 

The bound in Lemma 1 is unimprovable in a sense that there exist /, and distribution Pr(a:, y) for 
which the bound is tight. To see this, take, for instance, squared error loss £(j/, /) = {y — fY and classification 
accuracy metric 4'(FP,FN) = 1 — FP — FN. The constants in Lemma 1 are equal to 7 = 1, C = 2, and 
A = 8 (see Table 1), while the optimal threshold is 0* = 0. The bound then simplifies to 


Rego/i(sgn(/)) < yReg,q,(/), 
which is known to be tight (Bartlett et al, 2006). 


5 Multilabel classification 

In multilabel classification (Dcmbczyiiski et al, 2012; Paranibath et al, 2014; Koyejo et al, 2015), the goal 
is, given an input (feature vector) a; € X, to simultaneously predict the subset L C £ of the set of m labels 
C = {cTi,... ,(Tm}- The subset L is often called the set of relevant (positive) labels, while the complement 
C \ L is considered as irrelevant (negative) for x. We identify a set L of relevant labels with a vector 
y = (yi: 2 / 2 , • ■ ■ 7 2 / 771)7 Vi G {~l 71}7 in which = 1 iff tJi S L. We assume observations (x, y) are generated 
i.i.d. according to Pr(a:,y) (note that the labels are not assumed to be independent). A multilabel classifier: 

h{x) = {hi{x), h 2 {x ),..., 

is a mapping h: X ^ {—1,1}”’', which assigns a (predicted) label subset to each instance x G X. For 
any i = l,...,m, the function hi{x) is thus a binary classifier, which can be evaluated by means of 
TPi(/ii),FPi(hi),TNi(/ii) and FNi(hi), which are true/false positives/negatives defined with respect to label 
Vi, e.g. FPi(h0 = Prlhfix) = lAy, = -1). 

Let /i,..., /m be a set of real-valued functions /^: X —)■ K, i = 1,..., m, and let £ he a A-strongly proper 
composite loss for binary classification. For each i = 1,..., m, we let Risk}(/i) and Reg}(/i) denote the Arisk 
and the f-regret of function fi with respect to label yi'. 

RiskK/i) = [£iyi, fiix))] , RegK/i) = Risk}(/i) - minRisk}(/). 

Note that the problem has been decomposed into m independent binary problems and the functions can 
be obtained by training m independent real-valued binary classifiers by minimizing loss £ on the training 
sample, one for each out of m labels. 


12 








What follows next depends on the way in which the binary classification performance metric is applied in 
the multilabel setting. We consider two ways of turning binary classification metric into multilabel metric: 
the macro-averaging and the micro-averaging (Manning ct al, 2008; Parambath et al, 2014; Koycjo et al, 
2015). 

5.1 Macro-averaging 

Given a binary classification performance metric 4'(h) = 'I'(FP(/i), FN(/i), and a mnltilabel classifier h, we 
define the macro-averaged metric 1 'macroas: 

^ m ^ m 

4'macro(^) = “ V' 'i’{hi) = — ^ 4'(FPi (hi), FN^ (/l*)). 

2=1 2=1 

The macro-averaging is thus based on first computing the performance metric separately for each label, and 
then averaging the metrics over the labels. The ^macro-i'egret is then defined as: 

^ m 

(^) ~ 'lmacro(^v]/) ~ 'l'macro(^) — ^ ^ ^ ; 

2—1 

where h*^ = {h*^ u ■ ■ ■; m) i® I'-optimal multilabel classifier: 

i = argmax4'(FPi(h),FNi(h)), i = 1,... ,m. 

h 

Since the regret decomposes into a weighted sum, it is straightforward to apply previously derived bound to 
obtain a regret bound for macro-averaged performance metric. 

Theorem 6 . Let 'I'(FP,FN) and £ satisfy the assumptions of Lemma 1. For a set of m real-valued functions 
{fi : X —>■ let 6j_ = argmaxg for each i = 1,..., to. Then the classifier h defined as: 

^ = (^/i > ^/2 ’ ■ ■ ■ ’ ^ ’ 

achieves the following bound on its I'macro-^e^^’et.' 

Reg 2 p_(/i) < \/|;J^£ov/RegK/ 2 ), 

' 2=1 

where Ci = ^ (^'li{h% f){bi + 62 ) - (ai + 02 )^, i = 1 ,... ,to. 

Proof. The theorem follows from applying Theorem 2 once for each label, and then averaging the bounds 
over the labels. □ 

Theorem 6 suggests a straightforward decomposition into to independent binary classification problems, 
one for each label j/i,..., and running (independently for each problem) the two-step procedure described 
in Section 3: For i = 1,..., to, we learn a function fi with small Gregret with respect to label yi, and tune 
the threshold 9*p to optimize '^{hf^^g) (similarly as in the binary classification case, one can show that tuning 
the threshold on a separate validation sample is sufhcient). Due to decomposition of 'I'macro into the sum 
over the labels, this simple procedure turns out to be sufhcient. As we shall see, the case of micro-averaging 
becomes more interesting. 
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5.2 Micro-averaging 

Given a binary classification performance metrics = 'I'(FP(/i), FN(ft,)), and a multilabel classifier h, we 
define the micro-averaged metric as: 

^'microW = ^(FP(/^),FN(/^)), 

where: 

^ m ^ m 

FP(/i) = - VfP,(/i,), FN(/i) = - VFN,(h,). 

m m 

2=1 2=1 

Thus, in the micro-averaging, the false positives and false negatives are first averaged over the labels, and 
then the performance metric is calculated on these averaged quantities. The ^micro-regret: 

“ ^micro(^^) ^micro(^); where — argmax Tmicro(^)5 

h 

does not decompose into the sum over labels anymore. However, we are still able to obtain a regret bound, 
reusing the techniques from Section 4, and, interestingly, this time only a single threshold needs to be tuned 
and is shared among all labels. ^ 

Theorem 7. Let'I'(FP, FN) and I satisfy the assumptions of Lemma 1. For a set ofm real-valued functions 
{f^: X ^ K}™ let Of = argmaxg 5'micro(^/.e), where: 

Then, the classifier hf£* = ... ,hf^p*) achieves the following bound on its 4'micro-ret/ret; 

fo C ^- 

2=1 


where G = ^ {'itmicio{h^){bi + ^2) - («! + 02)). 

Proof. The proof follows closely the proof of Lemma 1. In fact, only Proposition 4 requires modifications, 
which are given below. Take any real values FP,FN and FP*,FN* (to be specified later) in the domain of 
4', such that: 

4'(FP*, FN*) - 4'(FP, FN) > 0. (12) 

Using exactly the same steps as in the derivation (6), we obtain: 

«'(FP*, FN*) - 4'(FP, FN) < C (a(FP - FP*) + (1 - a)(FN - FN*)) , 

where: 

C=^ (^(FP*, FN*)(6i + 62) - (ai + 02)) , 

4 '(FP*,FN*) 6 i -01 
“ “ «'(FP*, FN*)(6i + 62) - (ai + 02) ■ 

Now, we take: FP* = FP(h^),FN* = FN(h,^), FP = FP(h.) and FN = FN(/i) for some h. Hence, (12) 
is clearly satisfied as its left-hand side is just the micro-regret, Reg^^,_^_,^ (h). This means that for any 

®The fact that a single threshold is sufficient for consistency of micro-averaged performance measures was already noticed 
by Koyejo et al (2015). 
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multilabel classifier h: 


(«(FP(/i) - FP(/ri)) + (1 - a)(FN{h) - FN^h*^))) 

c ^ 

= - V a(FP,(/i,) - FP,{hl ,)) + (1 - a)(FN,(/iO - FN,(/ri J) 
m 

i^l 
^ m 

1— 1 

„ m 

m 

2 — 1 

where Risk^(/ii) and Reg^(/ii) are the cost-sensitive risk and the cost sensitive regret defined with respect 
to label yc 

Risk^(/iO = ¥.(^^^y.)[ia{yi,hi{x))], Reg^(/ii) = Risk^(/ii) - mmRisk^(/i). 

If we now take hi = hj^g*, where 9* = i/'(a), -0 being the link function of the loss, Proposition 5 (applied for 
each i = 1,..., TO separately) implies: 


PegL(/i/,.e*) < ^J^\jReg\{fi). 

Together, this gives: 

m -^- 

m v 

The theorem now follows by noticing that: 

9*f = argmax^'niicro(/j'/.e) = argminReg^ 
e e 

and thus Reg^^^„^ {hf^g*) < Reg^,„^„^ {hf,g-). □ 

Theorem 7 suggests a decomposition into to independent binary classification problems, one for each 
label yi,... ,ym, and training to real-valued classifiers /i,..., /^ with small ^-regret on the corresponding 
label. Then, however, contrary to macro-averaging, a single threshold, shared among all labels, is tuned by 
optimizing on a separate validation sample. 




6 Empirical results 

We perform experiments on synthetic and benchmark data to empirically study the two-step procedure 
analyzed in the previous sections. To this end, we minimize a surrogate loss in the first step to obtain a 
real-valued function /, and in the second step, we tune a threshold 0 on a separate validation set to optimize 
a given performance metric. We use logistic loss in this procedure as a surrogate loss. Recall that logistic 
loss is 4-strongly proper composite (see Table 2). We compare its performance with hinge loss, which is even 
not a proper composite function. As our task performance metrics, we take the F-measure (J(g-measure 
with /3 = 1) and the AM measure (which is a special case of Weighted Accuracy with weights wi = P and 
u >2 = 1 — f). We could also use the Jaccard similarity coefficient; it turns out, however, that the threshold 
optimized for the F-measure coincides with the optimal threshold for the Jaccard similarity coefficient (this 
is because the Jaccard similarity coefficient is strictly monotonic in the F-measure and vice versa), so the 
latter measure does not give anything substantially different than the F-measure. 

The experiments on benchmark data are split into two parts. The first part concerns binary classification 
problems, while the second part multi-label classification. 
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The purpose of this study is not about comparing the two-step approach with alternative methods; this 
has already been done in the previous work on the subject, see, e.g., (Nan et al, 2012; Parambath et al, 
2014). We also note that similar experiments have been performed in the cited papers on the statistical 
consistency of generalized performance metrics (Koyejo ct al, 2014; Narasimhan et al, 2014; Parambath 
ct al, 2014; Koyejo ct al, 2015). Therefore, we unavoidably repeat some of the results obtained therein, 
but the main novelty of the experiments reported here is that we emphasize the difference between strongly 
proper composite losses and non-proper losses. 

6.1 Synthetic data 

We performed two experiments on synthetic data. The first experiment deals with a discrete domain in 
which we learn within a class of all possible classifiers. The second experiment concerns continuous domain 
in which we learn within a restricted class of linear functions. 

First experiment. We let the input domain X to be a finite set, consisting of 25 elements, X = 
{1, 2,..., 25}, and take Pr(a:) to be uniform over X, i.e. Pr(a; = i) = 1/25. For each x G X, we ran¬ 
domly draw a value of ri{x) from the uniform distribution on the interval [0,1]. In the first step, we take an 
algorithm which minimizes a given surrogate loss i within the class of all function /: X —> M. Hence, given 
the training data of size n, the algorithm computes the empirical minimizer of surrogate loss i independently 
for each x. As surrogate losses, we use logistic and hinge loss. In the second step, we tune the threshold 9 
on a separate validation set, also of size n. For each n, we repeat the procedure 100,000 times, averaging 
over samples and over models (different random choices of rjlx)). We start with n = 100 and increase the 
number of training examples up to n = 10, 000. The ^-regret and ^z-regret can be easily computed, as the 
distribution is known and X is discrete. 

The results are given in Fig. 1. The £-regret goes down to zero for both surrogate losses, which is 
expected, since this is the objective function minimized by the algorithm. Minimization of logistic loss (left 
plot) gives vanishing ^/-regret for both the F-measure and the AM measure, as predicted by Theorem 2. In 
contrast, minimization of the hinge loss (right plot) is suboptimal for both task metrics and gives non-zero 
dt-regret even in the limit n —> oo. This behavior can easily be explained by the fact that hinge loss is not 
a proper (composite) loss: the risk minimizer for hinge loss is given by //(x) = sgn(77(a:) — 1/2) (Bartlett 
et al, 2006). Hence, the hinge loss minimizer is already a threshold function on rj{x), with the threshold 
value set to 1/2. If, for a given performance metric 4', the optimal threshold 9* is different than 1/2, the 
hinge loss minimizer will necessarily have suboptimal 4'-risk. This is clearly visible for the F-measure. The 
better result on the AM measure is explained by the fact that the average optimal threshold over all models 
is 0.5 for this measure, so the minimizer of hinge loss is not that far from the minimizer of AM measure. 

Second experiment. We take A = and generate x G X from a standard Gaussian distribution. We 
use a logistic model of the form r]{x) = ■ The weights a = ( 01 , 02 ) and oq are also drawn 

from a standard Gaussian. For a given model (set of weights), we take training sets of increasing size from 
n = 100 up to n = 3000, using 20 different sets for each n. We also generate one test set of size 100,000. For 
each n, we use 2/3 of the training data to learn a linear model f{x) = wq + w^x, using either support vector 
machines (SVM, with linear kernel) or logistic regression (LR). We use implementation of these algorithms 
from the LibLinear package (Fan ct al, 2008)."^ The remaining 1/3 of the training data is used for tuning 
the threshold. We average the results over 20 different models. 

The results are given in Fig. 2. As before, we plot the average Gregret for logistic and hinge loss, and 
dt-regret for the F-measure and the AM measure. The results obtained for LR (logistic loss minimizer) agree 
with our theoretical analysis: the Gregret and ^t-regret with respect to both F-measure and AM measure go 
to zero. This is expected, as the data generating model is a linear logistic model (so that the risk minimizer 
for logistic loss is a linear function), and thus coincides with a class of functions over which we optimize. The 
situation is different for SVM (hinge loss minimizer). Firstly, the Gregret for hinge loss does not converge to 

^Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear 
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Figure 1: Regret (averaged over 100,000 repetitions) on the discrete synthetic model as a function of the 
number of training examples. Left panel: logistic loss is used as a surrogate loss. Right panel: hinge loss is 
used as surrogate loss. 


I 


E 




Figure 2: Regret (averaged over 20 x 20 = 400 repetitions) on the logistic model as a function of the number 
of training examples. Left panel: regret with respect to the F-measure and surrogate losses. Right panel: 
regret with respect to the AM measure and surrogate losses. 


zero. This is because the risk minimizer for hinge loss is a threshold function sgn(? 7 (a:) — 1/2), and it is not 
possible to approximate such a function with linear model f{x) = wq -\-w^x. Hence, even when n —>■ oo, the 
empirical hinge loss minimizer (SVM) does not converge to the risk minimizer. This behavior, however, can 
be advantageous for SVM in terms of the task performance measures. This is because the risk minimizer for 
hinge loss, a threshold function on r]{x) with the threshold value 1/2, will perform poorly, for example, in 
terms of the F-measure and AM measure, for which the optimal threshold 9* is usually very different from 
1/2. In turn, the linear model constraint will prevent convergence to the risk minimizer, and the resulting 
linear function f{x) = wq -f x will often be close to some reversible function of ri{x)', hence after tuning 
the threshold, we will often end up close to the minimizer of a given task performance measure. This is seen 
for the F-measure on the left panel in Fig. 2. In this case, the F-regret of SVM gets quite close to zero, 
but is still worse than LR. The non-vanishing regret is mainly caused by the fact that for some models with 
imbalanced class priors, SVM reduce weights w to zero and sets the intercept wq to 1 or —1, predicting the 
same value for all x £ A (this is not caused by a software problem, it is how the empirical loss minimizer 
behaves). Interestingly, the F-measure is only slightly affected by this pathological behavior of empirical 
hinge loss minimizer. In turn, the AM measure, for which the plots are drawn in the right panel in Fig. 2, is 
not robust against this behavior of SVM: predicting the majority class actually results in the value of AM 
measure equal to 1/2, a very poor performance, which is on the same level as random classifier. 
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Figure 3: Average test set performance on benchmark data sets as a function of the number of training 
examples. Left panel: covtype dataset. Right panel: the gisette dataset. The top plots show logistic and 
hinge loss, the center plots show the F-measure, the bottom plots show the AM measure. 


6.2 Benchmark data for binary classification 

The next experiment is performed on two binary benchmark datasets,® described in Table 4. We randomly 
take out a test set of size 181,012 for covtype, and of size 3,000 for gisette. We use the remaining examples 
for training. As before, we incrementally increase the size of the training set. We use 2/3 of training examples 
for learning linear model with SVM or LR, and the rest for tuning the threshold. We repeat the experiment 
(random train/validation/test split) 20 times. The results are plotted in Fig 3. Since the data distribution 
is unknown, we are unable to compute the risk minimizers, hence we plot the average loss/metric on the test 
set rather than the regret. The results show that SVM perform better on the covtype dataset, while LR 
performs better on the gisette dataset. However, there is very little difference in performance of SVM and 
LR in terms of the F-measure and the AM measure on these data sets. We suspect this is due to the fact 
that f}{x) function is very different from linear for these problems, so that neither LR nor SVM converge to 
the Arisk minimizer, and Theorem 2 does not apply. Further studies would be required to understand the 

® Datasets are taken from LibSVM repository: http://www.csle.ntu.edu.tw/-cjlin/libsvmtools/datasets 
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dataset 

# examples 

#features 

covtype 

581,012 

54 

gisette 

7,000 

5,000 


Table 4: Basic statistics for binary classification benchmark datasets 


data set 

# labels 

^ training examples 

# test examples 

^features 

scene 

6 

1211 

1169 

294 

yeast 

14 

1500 

917 

103 

mediamill 

101 

30993 

12914 

120 


Table 5: Basic statistics for multi-label benchmark data sets 


behavior of surrogate losses in this case. 

6.3 Benchmark data for multi-label classification 

In the last experiment we use three multi-label benchmark data sets.® Table 5 provides a summary of basic 
statistics of these datasets. The aim of the experiment is to verify the theoretical results in Section 5 on 
learning the micro- and macro-averaged performance metrics. We use the F-measure and the AM-measure 
as in previous experiments. 

The data sets are already split into the training and testing parts. As before we train a linear model 
using either SVM or LR on 2/3 of training examples. The rest of training data is used for tuning the 
threshold. For optimizing macro-averaged measures, we tune the threshold separately for each label. This 
approach agrees with our analysis given in Section 5.1. For micro-averaging, we tune a common threshold 
for all labels: we simply collect predictions for all labels and find the best threshold using these values. 
This approach is justified by the theoretical analysis in Section 5.2. Hence, the only difference between 
micro- and macro-versions of the algorithms is whether a single or multiple thresholds are tuned. In total we 
use 8 algorithms: two learning algorithms (LR/SVM), two performance measures (F/AM), and two types 
of averaging (Macro/Micro). Note that our experiments include evaluating algorithms tuned for macro¬ 
averaging in terms of micro-averaged metrics, and vice versa. The goal of such cross-analysis is to determine 
the impact of threshold sharing for both averaging schemes. As before, we incrementally increase the size 
of the training set and repeat training and threshold tuning 20 times (we use random draws of training 
instances into the proper training and the validation parts; the test set is always the same, as originally 
specified for each data set). The results are given in Fig 4. 

The plots generally agree with the conclusions coming from the theoretical analysis, with some intriguing 
exceptions, however. As expected, LR tuned for a given performance metric gets the best result with respect 
to that metric in most of the cases. For the scene data set, however, the methods tuned for the micro- 
averaged metrics (single threshold shared among labels) outperform the ones tuned for macro-averaged 
metrics (separate thresholds tuned for each label), even when evaluated in terms of macro-averaged metrics. 
A similar result has been obtained by Koycjo ct al (2015). It seems that tuning a single threshold shared 
among all labels can lead to a more stable solution that is less prone to overfitting, even though it is not the 
optimal thing to do for macro-averaged measures. We further report that, interestingly, SVM outperform 
LR in terms of Macro-F on mediamill and this is the only case in which SVM get a better result than LR. 

®Datasets are taken from LibSVM repository: http://www.csie.ntu.edu.tw/~cjlin/libsvnitools/datasets 
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Figure 4: Average test set performance on benchmark data sets for multi-label classification as a function 
of the number of training examples. Macro- and micro-averaged F-measure and AM are plotted for LR and 
SVM tuned for all the measures. 
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7 Summary 

We present a theoretical analysis of a two-step approach to optimize classification performance metrics, which 
first learns a real-valued function / on a training sample by minimizing a surrogate loss, and then tunes 
the threshold on / by optimizing the target performance metric on a separate validation sample. We show 
that if the metric is a linear-fractional function, and the surrogate loss is strongly proper composite, then 
the regret of the resulting classifier (obtained from thresholding real-valued /) measured with respect to the 
target metric is upperbounded by the regret of / measured with respect to the surrogate loss. The proof of 
our result goes by an intermediate bound of the regret with respect to the target measure by a cost-sensitive 
classification regret. As a byproduct, we get a bound on the cost-sensitive classification regret by a surrogate 
regret of a real-valued function which holds simultaneously for all misclassification costs. We also extend our 
results to cover multilabel classification and provide regret bounds for micro- and macro-averaging measures. 
Our findings are backed up in a computational study on both synthetic and real data sets. 
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